El análisis de datos en el deporte es un área que lleva varios años en auge, hasta el punto de que en algunos casos como el fútbol americano o el baseball ha adquirido tal relevancia que tiene influencia directa en el desarrollo de los partidos.
El mundo del fútbol, sin embargo, parece algo reacio a incorporar esta forma de trabajo en su día a día, aunque en los últimos años ya están apareciendo empresas especializadas, como es el caso de Olocip, que empiezan a ofrecer soluciones reales para temas que antes pasaban desapercibidos.
Para adentrarnos un poco en este terreno hemos seleccionado un dataset (kaggle - La Liga Match Data) con datos de todos los partidos de la primera división española que acabarcan 6 temporadas (de 2014 a 2020).
Nuestro primer propósito será analizar los datos en profundidad para ver cómo se distribuyen los diferentes eventos en un partido de fútbol, para posteriormente determinar cuáles son aquellos que tienen mayor influencia en el resultado final.
En ultima instancia, tras conocer aquellas variables que influyen en mayor medida en la victoria de un equipo, se intentará comprobar si es posible predecir el desenlace de un partido futuro.
library(dplyr)
library(tidyr)
library(knitr)
library(readr)
library(ggplot2)
library(cowplot)
library(GGally)
library(kableExtra)
library(plotrix)
library(ggcorrplot)
En esta primera etapa cargamos el dataset descargado y hacemos una observación superficial de las variables recogidas y sus características. Se representan también las 10 filas iniciales (primera jornada).
laliga <- read_csv("combined_data_laliga.csv")
head(laliga, 10) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| …1 | Home Team | Away Team | Score | Half Time Score | Match Excitement | Home Team Rating | Away Team Rating | Home Team Possession % | Away Team Possession % | Home Team Off Target Shots | Home Team On Target Shots | Home Team Total Shots | Home Team Blocked Shots | Home Team Corners | Home Team Throw Ins | Home Team Pass Success % | Home Team Aerials Won | Home Team Clearances | Home Team Fouls | Home Team Yellow Cards | Home Team Second Yellow Cards | Home Team Red Cards | Away Team Off Target Shots | Away Team On Target Shots | Away Team Total Shots | Away Team Blocked Shots | Away Team Corners | Away Team Throw Ins | Away Team Pass Success % | Away Team Aerials Won | Away Team Clearances | Away Team Fouls | Away Team Yellow Cards | Away Team Second Yellow Cards | Away Team Red Cards | Home Team Goals Scored | Away Team Goals Scored | Home Team Goals Conceeded | Away Team Goals Conceeded | year |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | MÁLAGA | ATHLETIC | 1-0 | 1-0 | 4.4 | 6.0 | 5.7 | 40 | 60 | 5 | 3 | 12 | 4 | 5 | 13 | 69 | 11 | 16 | 13 | 3 | 0 | 2 | 5 | 5 | 12 | 2 | 4 | 22 | 84 | 17 | 14 | 9 | 3 | 0 | 0 | 1 | 0 | 0 | 1 | 2014 |
| 1 | SEVILLA FC | VALENCIA | 1-1 | 1-0 | 4.7 | 6.5 | 6.8 | 47 | 53 | 4 | 3 | 12 | 5 | 3 | 20 | 76 | 13 | 16 | 23 | 4 | 0 | 0 | 5 | 1 | 11 | 5 | 3 | 27 | 79 | 14 | 16 | 8 | 2 | 0 | 1 | 1 | 1 | 1 | 1 | 2014 |
| 2 | GRANADA | DEPORTIVO | 2-1 | 0-1 | 4.6 | 7.2 | 5.9 | 53 | 47 | 6 | 3 | 10 | 1 | 5 | 25 | 79 | 20 | 21 | 13 | 1 | 0 | 0 | 1 | 1 | 8 | 6 | 3 | 30 | 76 | 10 | 14 | 26 | 3 | 0 | 0 | 2 | 1 | 1 | 2 | 2014 |
| 3 | ALMERÍA | ESPANYOL | 1-1 | 0-0 | 5.6 | 6.9 | 5.5 | 56 | 44 | 7 | 6 | 19 | 6 | 11 | 26 | 81 | 19 | 25 | 8 | 3 | 0 | 0 | 6 | 2 | 12 | 4 | 7 | 19 | 70 | 11 | 20 | 9 | 3 | 1 | 0 | 1 | 1 | 1 | 1 | 2014 |
| 4 | EIBAR | REAL SOCIEDAD | 1-0 | 1-0 | 3.7 | 6.5 | 5.9 | 41 | 59 | 5 | 5 | 12 | 2 | 5 | 28 | 60 | 29 | 17 | 13 | 4 | 0 | 0 | 7 | 4 | 19 | 8 | 6 | 36 | 75 | 25 | 20 | 14 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 2014 |
| 5 | BARCELONA | ELCHE | 3-0 | 1-0 | 5.1 | 8.0 | 5.3 | 72 | 28 | 5 | 6 | 12 | 1 | 3 | 23 | 93 | 2 | 2 | 11 | 0 | 0 | 1 | 3 | 0 | 3 | 0 | 1 | 19 | 81 | 4 | 17 | 13 | 1 | 0 | 0 | 3 | 0 | 0 | 3 | 2014 |
| 6 | CELTA | GETAFE | 3-1 | 1-0 | 7.0 | 7.3 | 5.5 | 59 | 41 | 4 | 9 | 20 | 7 | 4 | 21 | 82 | 27 | 15 | 19 | 4 | 0 | 0 | 8 | 3 | 12 | 1 | 4 | 19 | 74 | 22 | 7 | 15 | 1 | 0 | 0 | 3 | 1 | 1 | 3 | 2014 |
| 7 | LEVANTE | VILLARREAL | 0-2 | 0-0 | 4.7 | 5.6 | 7.3 | 53 | 47 | 4 | 2 | 9 | 3 | 3 | 16 | 80 | 19 | 12 | 11 | 1 | 0 | 0 | 5 | 6 | 12 | 1 | 4 | 14 | 78 | 21 | 18 | 14 | 3 | 0 | 0 | 0 | 2 | 2 | 0 | 2014 |
| 8 | REAL MADRID | CÓRDOBA | 2-0 | 1-0 | 4.7 | 6.9 | 5.5 | 63 | 37 | 3 | 8 | 14 | 3 | 8 | 25 | 88 | 23 | 13 | 9 | 1 | 0 | 0 | 6 | 2 | 8 | 0 | 5 | 18 | 79 | 15 | 27 | 13 | 2 | 0 | 0 | 2 | 0 | 0 | 2 | 2014 |
| 9 | RAYO VALLECANO | ATLETICO MADRID | 0-0 | 0-0 | 2.1 | 6.9 | 5.9 | 59 | 41 | 5 | 2 | 8 | 1 | 2 | 23 | 79 | 21 | 20 | 13 | 2 | 0 | 0 | 3 | 2 | 9 | 4 | 4 | 34 | 68 | 26 | 15 | 17 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2014 |
summary(laliga)
## ...1 Home Team Away Team Score
## Min. : 0.0 Length:2660 Length:2660 Length:2660
## 1st Qu.: 664.8 Class :character Class :character Class :character
## Median :1329.5 Mode :character Mode :character Mode :character
## Mean :1329.5
## 3rd Qu.:1994.2
## Max. :2659.0
##
## Half Time Score Match Excitement Home Team Rating Away Team Rating
## Length:2660 Min. : 1.400 Min. : 2.800 Min. : 3.000
## Class :character 1st Qu.: 3.900 1st Qu.: 5.700 1st Qu.: 5.400
## Mode :character Median : 5.100 Median : 6.400 Median : 6.000
## Mean : 5.249 Mean : 6.369 Mean : 6.045
## 3rd Qu.: 6.300 3rd Qu.: 6.900 3rd Qu.: 6.600
## Max. :10.000 Max. :10.000 Max. :10.000
##
## Home Team Possession % Away Team Possession % Home Team Off Target Shots
## Min. :18.00 Min. :17.00 Min. : 0.000
## 1st Qu.:44.00 1st Qu.:41.00 1st Qu.: 4.000
## Median :52.00 Median :48.00 Median : 5.000
## Mean :51.54 Mean :48.46 Mean : 5.533
## 3rd Qu.:59.00 3rd Qu.:56.00 3rd Qu.: 7.000
## Max. :83.00 Max. :82.00 Max. :18.000
## NA's :1
## Home Team On Target Shots Home Team Total Shots Home Team Blocked Shots
## Min. : 0.000 Min. : 2.00 Min. : 0.000
## 1st Qu.: 3.000 1st Qu.:10.00 1st Qu.: 1.000
## Median : 4.000 Median :13.00 Median : 3.000
## Mean : 4.649 Mean :13.08 Mean : 2.913
## 3rd Qu.: 6.000 3rd Qu.:16.00 3rd Qu.: 4.000
## Max. :17.000 Max. :33.00 Max. :12.000
## NA's :1
## Home Team Corners Home Team Throw Ins Home Team Pass Success %
## Min. : 0.000 Min. : 5.0 Min. :46.00
## 1st Qu.: 3.000 1st Qu.:18.0 1st Qu.:73.00
## Median : 5.000 Median :22.0 Median :78.00
## Mean : 5.395 Mean :22.5 Mean :77.71
## 3rd Qu.: 7.000 3rd Qu.:27.0 3rd Qu.:84.00
## Max. :20.000 Max. :49.0 Max. :93.00
##
## Home Team Aerials Won Home Team Clearances Home Team Fouls
## Min. : 2.0 Min. : 1.00 Min. : 1.00
## 1st Qu.:12.0 1st Qu.:12.00 1st Qu.:11.00
## Median :16.0 Median :17.00 Median :14.00
## Mean :17.1 Mean :17.71 Mean :13.75
## 3rd Qu.:22.0 3rd Qu.:22.00 3rd Qu.:16.00
## Max. :52.0 Max. :61.00 Max. :33.00
## NA's :1
## Home Team Yellow Cards Home Team Second Yellow Cards Home Team Red Cards
## Min. :0.000 Min. :0.00000 Min. :0.00000
## 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:0.00000
## Median :2.000 Median :0.00000 Median :0.00000
## Mean :2.417 Mean :0.05604 Mean :0.04699
## 3rd Qu.:3.000 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :8.000 Max. :2.00000 Max. :2.00000
## NA's :1
## Away Team Off Target Shots Away Team On Target Shots Away Team Total Shots
## Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 3.000 1st Qu.: 2.000 1st Qu.: 7.00
## Median : 4.000 Median : 3.000 Median :10.00
## Mean : 4.319 Mean : 3.669 Mean :10.36
## 3rd Qu.: 6.000 3rd Qu.: 5.000 3rd Qu.:13.00
## Max. :14.000 Max. :13.000 Max. :28.00
## NA's :1
## Away Team Blocked Shots Away Team Corners Away Team Throw Ins
## Min. : 0.000 Min. : 0.000 Min. : 3.00
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.:17.00
## Median : 2.000 Median : 4.000 Median :21.00
## Mean : 2.391 Mean : 4.215 Mean :21.48
## 3rd Qu.: 3.000 3rd Qu.: 6.000 3rd Qu.:26.00
## Max. :13.000 Max. :32.000 Max. :45.00
## NA's :1 NA's :1 NA's :1
## Away Team Pass Success % Away Team Aerials Won Away Team Clearances
## Min. :41.0 Min. : 0.00 Min. : 1.00
## 1st Qu.:71.0 1st Qu.:11.00 1st Qu.:14.00
## Median :77.0 Median :16.00 Median :20.00
## Mean :75.9 Mean :16.75 Mean :21.43
## 3rd Qu.:82.0 3rd Qu.:21.00 3rd Qu.:27.00
## Max. :93.0 Max. :53.00 Max. :63.00
## NA's :2 NA's :1
## Away Team Fouls Away Team Yellow Cards Away Team Second Yellow Cards
## Min. : 0.00 Min. :0.000 Min. :0.00000
## 1st Qu.:11.00 1st Qu.:2.000 1st Qu.:0.00000
## Median :13.00 Median :3.000 Median :0.00000
## Mean :13.77 Mean :2.641 Mean :0.07744
## 3rd Qu.:16.00 3rd Qu.:4.000 3rd Qu.:0.00000
## Max. :30.00 Max. :8.000 Max. :2.00000
##
## Away Team Red Cards Home Team Goals Scored Away Team Goals Scored
## Min. :0.00000 Min. : 0.000 Min. :0.000
## 1st Qu.:0.00000 1st Qu.: 1.000 1st Qu.:0.000
## Median :0.00000 Median : 1.000 Median :1.000
## Mean :0.04887 Mean : 1.518 Mean :1.141
## 3rd Qu.:0.00000 3rd Qu.: 2.000 3rd Qu.:2.000
## Max. :2.00000 Max. :10.000 Max. :8.000
##
## Home Team Goals Conceeded Away Team Goals Conceeded year
## Min. :0.000 Min. : 0.000 Min. :2014
## 1st Qu.:0.000 1st Qu.: 1.000 1st Qu.:2015
## Median :1.000 Median : 1.000 Median :2017
## Mean :1.141 Mean : 1.518 Mean :2017
## 3rd Qu.:2.000 3rd Qu.: 2.000 3rd Qu.:2019
## Max. :8.000 Max. :10.000 Max. :2020
##
El dataset contiene 2660 observaciones (correspondientes a un partido cada una) y 40 variables (de las cuales son 4 cualitativas y 36 cuantitativas discretas). Las mayoría de estas variables cuantitativas (34) son las acciones del partido duplicadas, correspondientes a cada equipo (local y visitante).
Respecto a los datos faltantes (NA), el dataset original no tenía ninguno. Sin embargo se han introducido algunos de forma aleatoria para su posterior tratamiento.
A continuación, la descripción de cada una de las variables:
Home Team: Equipo local
Away Team: Equipo visitante
Score: Resultado (ejemplo: 2-1)
Half Time Score: Resultado al descanso (ejemplo:
1-1)
Match Excitement: Emoción del partido (subjetiva)
Home Team Rating: Valoración del local
(subjetiva)
Away Team Rating: Valoración del visitante
(subjetiva)
Home Team Possession: Posesión local
Away Team Possession: Posesión visitante
Home Team off Target Shots: Tiros fuera local
Home Team on Target Shots: Tiros a puerta local
Home Team Total Shots: Total tiros local
Home Team Blocked Shots: Tiros bloqueados local
Home Team Corners: Corners local
Home Team Throw Ins: Centros local
Home Team Pass Success: Pases exitosos local
Home Team Aerials Won: Balones aereos ganados
local
Home Team Clearances: Despejes local
Home Team Fouls: Faltas local
Home Team Yellow Cards: Amarillas local
Home Team Second Yellow Cards: Segundas amarillas
local
Home Team Red Cards: Rojas local
Away Team off Target Shots: Tiros fuera visitante
Away Team on Target Shots: Tiros a puerta
visitante
Away Team Total Shots: Total tiros visitante
Away Team Blocked Shots: Tiros bloqueados
visitante
Away Team Corners: Corners visitante
Away Team Throw Ins: Centros visitante
Away Team Pass Success: Pases exitosos visitante
Away Team Aerials Won: Balones aereos ganados
visitante
Away Team Clearances: Despejes visitante
Away Team Fouls: Faltas visitante
Away Team Yellow Cards: Amarillas visitante
Away Team Second Yellow Cards: Segundas amarillas
visitante
Away Team Red Cards: Rojas visitante
Home Team Goals Scored: Goles local
Away Team Goals Scored: Goles visitante
Home Team Goals Conceeded: Goles visitante
Away Team Goals Conceeded: Goles local
Year: Año
Tras esta primera visualización, se hará una limpieza del dataset:
En primer lugar renombraremos todas las variables para eliminar los espacios (que entorpecen su tratamiento), sustituyéndolos por guiones bajos.
Las variables Match excitement, Home Team Rating y Away Team Rating serán eliminadas, ya que corresponden a valoraciones subjetivas que no nos interesa analizar.
Tanto Home Team Goals Conceeded como Away Team Goals Conceeded son redundantes, ya que ya están recogidas en las columnas de goles anotados por cada equipo, por lo que también las eliminamos.
Por otra parte, la variable Score (en formato texto) la transformaremos en una variable categórica con los resultados a modo de quinieral (1,X,2). El resultado al descanso Half Time Score también será eliminado, ya que nuestro objetivo será predecir resultado antes del comienzo del partido, no a mitad de este.
Las variables que definen los equipos del partido (Home Team y Away Team) se transformarán en variables categóricas para una correcta visualización más adelante.
Por último, de cara a futuros estudios con la posición en liga de cada equipo en el momento del partido, definiremos tres nuevas variables con el número de jornada correspondiente (ya que hemos comprobado que el dataset está ordenado cronológicamente) y los puntos ganados por cada equipo (0, 1 ó 3).
# Limpieza de las columnas quitando los espacios y cambiandolos por guiones
names(laliga) <- gsub(" ","_", names(laliga))
# Limpieza de variables que no nos interesan
laliga$Score <- NULL
laliga$`Half_Time_Score` <- NULL
laliga$`Match_Excitement` <- NULL
laliga$`Home_Team_Rating` <- NULL
laliga$`Away_Team_Rating` <- NULL
laliga$`Home_Team_Goals_Conceeded` <- NULL
laliga$`Away_Team_Goals_Conceeded` <- NULL
# Creación de campos que se quieren estudiar más adelante
laliga <- mutate(laliga, score = ifelse(Home_Team_Goals_Scored-Away_Team_Goals_Scored > 0, "1",
ifelse(Home_Team_Goals_Scored-Away_Team_Goals_Scored < 0, "2", "x")))
laliga <- mutate(laliga, jornada = ceiling(row_number() / 10)) %>%
mutate(jornada = (jornada - 1) %% 38 + 1)
laliga <- mutate(laliga, home_points = case_when(score == 1 ~ 3,
score == "x" ~ 1,
score == 2 ~ 0))
laliga <- mutate(laliga, away_points = case_when(score == 1 ~ 0,
score == "x" ~ 1,
score == 2 ~ 3))
# Convertimos a categorías
laliga$Home_Team <- as.factor(laliga$Home_Team)
laliga$Away_Team <- as.factor(laliga$Away_Team)
laliga$score <- factor(laliga$score, levels = c("1", "x", "2"))
Si alguno de los NA hubiera aparecido en las variables categóricas (equipos y resultados), la elección para su tratamiento hubiera sido acudir a cualquier fuente (google, por ejemplo) en busca del partido concreto para obtenerlas, ya que sabiendo la jornada y la temporada del partido apenas llevaría unos segundos encontrarlas y evitaríamos eliminar esas filas.
Sin embargo en este caso solo aparecen en las siguientes variables numéricas:
Siendo tan pocos respecto al tamaño del dataset y comprendiendo que es preferible mantener estas filas (ya que los campos aparentemente más relevantes como los goles o los equipos sí aparecen), se ha optado por sustituir estos NA por la mediana del resto de valores de esa variable.
Sabiendo que la imputación de datos faltantes con medidas de centralidad (como es la mediana que hemos utilizado) está desaconsejada porque tiene a introducir sesgo en los estimadores, en nuestro caso se ha considerado que, dado el escaso número de NA, no compensaba utilizar un método de imputación más complejo, como una regresión.
laliga$`Home_Team_Possession_%`[is.na(laliga$`Home_Team_Possession_%`)] <- median(laliga$`Home_Team_Possession_%`[!is.na(laliga$`Home_Team_Possession_%`)])
laliga$Home_Team_On_Target_Shots[is.na(laliga$Home_Team_On_Target_Shots)] <- median(laliga$Home_Team_On_Target_Shots[!is.na(laliga$Home_Team_On_Target_Shots)])
laliga$Home_Team_Clearances[is.na(laliga$Home_Team_Clearances)] <- median(laliga$Home_Team_Clearances[!is.na(laliga$Home_Team_Clearances)])
laliga$Home_Team_Second_Yellow_Cards[is.na(laliga$Home_Team_Second_Yellow_Cards)] <- median(laliga$Home_Team_Second_Yellow_Cards[!is.na(laliga$Home_Team_Second_Yellow_Cards)])
laliga$Away_Team_Off_Target_Shots[is.na(laliga$Away_Team_Off_Target_Shots)] <- median(laliga$Away_Team_Off_Target_Shots[!is.na(laliga$Away_Team_Off_Target_Shots)])
laliga$Away_Team_Blocked_Shots[is.na(laliga$Away_Team_Blocked_Shots)] <- median(laliga$Away_Team_Blocked_Shots[!is.na(laliga$Away_Team_Blocked_Shots)])
laliga$Away_Team_Corners[is.na(laliga$Away_Team_Corners)] <- median(laliga$Away_Team_Corners[!is.na(laliga$Away_Team_Corners)])
laliga$Away_Team_Throw_Ins[is.na(laliga$Away_Team_Throw_Ins)] <- median(laliga$Away_Team_Throw_Ins[!is.na(laliga$Away_Team_Throw_Ins)])
laliga$`Away_Team_Pass_Success_%`[is.na(laliga$`Away_Team_Pass_Success_%`)] <- round(mean(laliga$`Away_Team_Pass_Success_%`[!is.na(laliga$`Away_Team_Pass_Success_%`)]), 0)
laliga$Away_Team_Aerials_Won[is.na(laliga$Away_Team_Aerials_Won)] <- round(mean(laliga$Away_Team_Aerials_Won[!is.na(laliga$Away_Team_Aerials_Won)]), 0)
Para finalizar la preparación de los datos, dividiremos nuestro dataset en dos partes, una de entrenamiento (train), con el que trabajaremos durante todo el análisis, y otra de validación (test), que nos servirá para comprobar la eficacia del modelo final. Para esta división nos ha parecido apropiado separar las dos últimas temporadas (aproximadamente el 30% del total), en caso de que observáramos alguna tendencia dependiente del tiempo que pudiera ayudar en la predicción del modelo.
test <- filter(laliga, year > 2018)
train <- filter(laliga, year <= 2018)
Una vez preparado el dataset comenzamos con el análisis de las variables cualitativas. En nuestro caso solo tenemos tres de este tipo (equipo local, equipo visitante y resultado) y empezaremos por las dos primeras:
# Sacamos los partidos que ha jugado cada equipo (más años en primera division)
data.frame(Equipo = rownames(table(train$Home_Team) + table(train$Away_Team)),
Partidos = as.numeric(table(train$Home_Team) + table(train$Away_Team))) %>%
arrange(desc(Partidos)) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| Equipo | Partidos |
|---|---|
| ATHLETIC | 190 |
| ATLETICO MADRID | 190 |
| BARCELONA | 190 |
| CELTA | 190 |
| EIBAR | 190 |
| ESPANYOL | 190 |
| REAL MADRID | 190 |
| REAL SOCIEDAD | 190 |
| SEVILLA FC | 190 |
| VALENCIA | 190 |
| VILLARREAL | 190 |
| DEPORTIVO | 152 |
| GETAFE | 152 |
| LEVANTE | 152 |
| MÁLAGA | 152 |
| REAL BETIS | 152 |
| ALAVÉS | 114 |
| GRANADA | 114 |
| LAS PALMAS | 114 |
| LEGANÉS | 114 |
| RAYO VALLECANO | 114 |
| GIJÓN | 76 |
| GIRONA | 76 |
| ALMERÍA | 38 |
| CÓRDOBA | 38 |
| ELCHE | 38 |
| HUESCA | 38 |
| OSASUNA | 38 |
| VALLADOLID | 38 |
| CÁDIZ CF | 0 |
| MALLORCA | 0 |
ggplot(train, aes(Home_Team)) +
geom_bar(fill = "darkgreen") +
coord_flip() +
labs(x = "Equipo", y = "Partidos", title = "Partidos jugados como local") +
theme(plot.title = element_text(hjust = 0.5))
Tanto en el análisis numérico como en la gráfica podemos observar que hay algunos equipos que han jugado todos los partidos posibles, mientras que otros no, lo cual nos indica aquellos que han estado al menos una temporada en segunda división.
Respecto a los resultados, se puede ver a continuación cómo el equipo local gana prácticamente la mitad de los partidos (46.42%), por un 28.95% de victorias visitantes y un 24.63% de empates.
# Número y proporción de victorias, derrotas y empates
table(train$score)
##
## 1 x 2
## 882 468 550
round(prop.table(table(train$score))*100, 2)
##
## 1 x 2
## 46.42 24.63 28.95
ggplot(train, aes(score, fill = score)) +
geom_bar() +
labs(x = "Resultado", y = "Partidos", title = "Nº de victorias locales, empates y victorias visitantes") +
theme(plot.title = element_text(hjust = 0.5))
Tras haber analizado ambas variables por separado, a continuación se hará de forma conjunta, mostrando la cantidad y la proporción de partidos que cada equipo gana, empate y pierde dependiendo de si es local o visitante (en ese orden).
# Partidos ganados, empatados y perdidos por cada equipo como local
with(train, table(Home_Team, score)) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| 1 | x | 2 | |
|---|---|---|---|
| ALAVÉS | 23 | 17 | 17 |
| ALMERÍA | 3 | 7 | 9 |
| ATHLETIC | 47 | 30 | 18 |
| ATLETICO MADRID | 70 | 17 | 8 |
| BARCELONA | 78 | 11 | 6 |
| CÁDIZ CF | 0 | 0 | 0 |
| CELTA | 42 | 26 | 27 |
| CÓRDOBA | 1 | 6 | 12 |
| DEPORTIVO | 20 | 25 | 31 |
| EIBAR | 40 | 21 | 34 |
| ELCHE | 6 | 3 | 10 |
| ESPANYOL | 44 | 25 | 26 |
| GETAFE | 32 | 19 | 25 |
| GIJÓN | 12 | 8 | 18 |
| GIRONA | 11 | 9 | 18 |
| GRANADA | 14 | 19 | 24 |
| HUESCA | 5 | 6 | 8 |
| LAS PALMAS | 21 | 13 | 23 |
| LEGANÉS | 21 | 18 | 18 |
| LEVANTE | 26 | 24 | 26 |
| MÁLAGA | 30 | 17 | 29 |
| MALLORCA | 0 | 0 | 0 |
| OSASUNA | 2 | 7 | 10 |
| RAYO VALLECANO | 21 | 12 | 24 |
| REAL BETIS | 30 | 22 | 24 |
| REAL MADRID | 71 | 12 | 12 |
| REAL SOCIEDAD | 43 | 24 | 28 |
| SEVILLA FC | 64 | 18 | 13 |
| VALENCIA | 49 | 27 | 19 |
| VALLADOLID | 5 | 5 | 9 |
| VILLARREAL | 51 | 20 | 24 |
# Partidos ganados, empatados y perdidos por cada equipo como visitante
with(train, table(Away_Team, score)) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| 1 | x | 2 | |
|---|---|---|---|
| ALAVÉS | 29 | 9 | 19 |
| ALMERÍA | 13 | 1 | 5 |
| ATHLETIC | 46 | 21 | 28 |
| ATLETICO MADRID | 21 | 25 | 49 |
| BARCELONA | 11 | 21 | 63 |
| CÁDIZ CF | 0 | 0 | 0 |
| CELTA | 49 | 22 | 24 |
| CÓRDOBA | 12 | 5 | 2 |
| DEPORTIVO | 37 | 30 | 9 |
| EIBAR | 46 | 29 | 20 |
| ELCHE | 9 | 5 | 5 |
| ESPANYOL | 46 | 27 | 22 |
| GETAFE | 38 | 21 | 17 |
| GIJÓN | 22 | 11 | 5 |
| GIRONA | 16 | 10 | 12 |
| GRANADA | 38 | 12 | 7 |
| HUESCA | 11 | 6 | 2 |
| LAS PALMAS | 40 | 11 | 6 |
| LEGANÉS | 35 | 12 | 10 |
| LEVANTE | 45 | 18 | 13 |
| MÁLAGA | 45 | 18 | 13 |
| MALLORCA | 0 | 0 | 0 |
| OSASUNA | 14 | 3 | 2 |
| RAYO VALLECANO | 35 | 11 | 11 |
| REAL BETIS | 40 | 13 | 23 |
| REAL MADRID | 19 | 17 | 59 |
| REAL SOCIEDAD | 45 | 23 | 27 |
| SEVILLA FC | 44 | 23 | 28 |
| VALENCIA | 36 | 25 | 34 |
| VALLADOLID | 8 | 6 | 5 |
| VILLARREAL | 32 | 33 | 30 |
# Proporción de partidos ganados, empatados y perdidos por cada equipo como local
round(t((prop.table(with(train, table(score, Home_Team)), margin = 2)))*100,2) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| 1 | x | 2 | |
|---|---|---|---|
| ALAVÉS | 40.35 | 29.82 | 29.82 |
| ALMERÍA | 15.79 | 36.84 | 47.37 |
| ATHLETIC | 49.47 | 31.58 | 18.95 |
| ATLETICO MADRID | 73.68 | 17.89 | 8.42 |
| BARCELONA | 82.11 | 11.58 | 6.32 |
| CÁDIZ CF | NaN | NaN | NaN |
| CELTA | 44.21 | 27.37 | 28.42 |
| CÓRDOBA | 5.26 | 31.58 | 63.16 |
| DEPORTIVO | 26.32 | 32.89 | 40.79 |
| EIBAR | 42.11 | 22.11 | 35.79 |
| ELCHE | 31.58 | 15.79 | 52.63 |
| ESPANYOL | 46.32 | 26.32 | 27.37 |
| GETAFE | 42.11 | 25.00 | 32.89 |
| GIJÓN | 31.58 | 21.05 | 47.37 |
| GIRONA | 28.95 | 23.68 | 47.37 |
| GRANADA | 24.56 | 33.33 | 42.11 |
| HUESCA | 26.32 | 31.58 | 42.11 |
| LAS PALMAS | 36.84 | 22.81 | 40.35 |
| LEGANÉS | 36.84 | 31.58 | 31.58 |
| LEVANTE | 34.21 | 31.58 | 34.21 |
| MÁLAGA | 39.47 | 22.37 | 38.16 |
| MALLORCA | NaN | NaN | NaN |
| OSASUNA | 10.53 | 36.84 | 52.63 |
| RAYO VALLECANO | 36.84 | 21.05 | 42.11 |
| REAL BETIS | 39.47 | 28.95 | 31.58 |
| REAL MADRID | 74.74 | 12.63 | 12.63 |
| REAL SOCIEDAD | 45.26 | 25.26 | 29.47 |
| SEVILLA FC | 67.37 | 18.95 | 13.68 |
| VALENCIA | 51.58 | 28.42 | 20.00 |
| VALLADOLID | 26.32 | 26.32 | 47.37 |
| VILLARREAL | 53.68 | 21.05 | 25.26 |
# Proporción de partidos ganados, empatados y perdidos por cada equipo como visitante
round(t((prop.table(with(train, table(score, Away_Team)), margin = 2)))*100,2) %>%
kbl() %>%
kable_material(c("striped", "hover")) %>%
scroll_box(width = "100%", height = "350px")
| 1 | x | 2 | |
|---|---|---|---|
| ALAVÉS | 50.88 | 15.79 | 33.33 |
| ALMERÍA | 68.42 | 5.26 | 26.32 |
| ATHLETIC | 48.42 | 22.11 | 29.47 |
| ATLETICO MADRID | 22.11 | 26.32 | 51.58 |
| BARCELONA | 11.58 | 22.11 | 66.32 |
| CÁDIZ CF | NaN | NaN | NaN |
| CELTA | 51.58 | 23.16 | 25.26 |
| CÓRDOBA | 63.16 | 26.32 | 10.53 |
| DEPORTIVO | 48.68 | 39.47 | 11.84 |
| EIBAR | 48.42 | 30.53 | 21.05 |
| ELCHE | 47.37 | 26.32 | 26.32 |
| ESPANYOL | 48.42 | 28.42 | 23.16 |
| GETAFE | 50.00 | 27.63 | 22.37 |
| GIJÓN | 57.89 | 28.95 | 13.16 |
| GIRONA | 42.11 | 26.32 | 31.58 |
| GRANADA | 66.67 | 21.05 | 12.28 |
| HUESCA | 57.89 | 31.58 | 10.53 |
| LAS PALMAS | 70.18 | 19.30 | 10.53 |
| LEGANÉS | 61.40 | 21.05 | 17.54 |
| LEVANTE | 59.21 | 23.68 | 17.11 |
| MÁLAGA | 59.21 | 23.68 | 17.11 |
| MALLORCA | NaN | NaN | NaN |
| OSASUNA | 73.68 | 15.79 | 10.53 |
| RAYO VALLECANO | 61.40 | 19.30 | 19.30 |
| REAL BETIS | 52.63 | 17.11 | 30.26 |
| REAL MADRID | 20.00 | 17.89 | 62.11 |
| REAL SOCIEDAD | 47.37 | 24.21 | 28.42 |
| SEVILLA FC | 46.32 | 24.21 | 29.47 |
| VALENCIA | 37.89 | 26.32 | 35.79 |
| VALLADOLID | 42.11 | 31.58 | 26.32 |
| VILLARREAL | 33.68 | 34.74 | 31.58 |
También se muestra la visualización de estos datos para una comprensión
mas intuitiva:
ggplot(train, aes(x = Home_Team, color = score, fill = score)) + geom_bar() + facet_wrap(~score) + coord_flip() + labs(x = "Equipo", y = "Partidos", title = "Nº de victorias, empates y derrotas como equipo local")+ theme(plot.title = element_text(hjust = 0.5))
ggplot(train, aes(x = Away_Team, color = score, fill = score)) + geom_bar() + facet_wrap(~score) + coord_flip() + labs(x = "Equipo", y = "Partidos", title = "Nº de victorias, empates y derrotas como equipo local")+ theme(plot.title = element_text(hjust = 0.5))
Es interesante ver como, tras este análisis de las variables cualitativas ya se pueden identificar los equipos punteros de la liga.
No sólo hay 11 equipos que se han mantenido en primera todos los años (Athletic, Atlético, Barcelona, Celta, Eibar, Espanyol, Real Madrid, Real Sociedad, Sevilla, Valencia y Villarreal) si no que algunos de ellos cosechan una mayor proporción de victorias.
En este grupo destacan sobre el resto los tres “grandes” de la actualidad (Real Madrid, Barcelona y Atlético), ya que aunque en partidos como local hay alguno equipo (como el Sevilla) que se les acerca en victorias, es en los partidos como visitante donde estos tres destacan por encima del resto (especialmente Madrid y Barcelona).
Lo cual tiene especial relevancia si comparamos estos datos con los del inicio, en los que se veía que jugar en casa era un factor determinante para ganar un partido.
Por lo tanto, un buen indicio para saber cómo de bueno es un equipo es ver su proporción de victorias como visitante.
En el caso de las variables cuantitativas tenemos una gran cantidad de ellas para analizar, así que para un mejor entendimiento de las mismas se analizarán agrupándolas por eventos similares en el partido y comparándolas entre equipo local y visitante, tanto numérica como visualmente con gráficos de cajas.
# Posesion
bind_rows(data.frame(summarise(train,
max = max(`Home_Team_Possession_%`),
min = min(`Home_Team_Possession_%`),
mean = mean(`Home_Team_Possession_%`),
sd = sd(`Home_Team_Possession_%`))),
data.frame(summarise(train,
max = max(`Away_Team_Possession_%`),
min = min(`Away_Team_Possession_%`),
mean = mean(`Away_Team_Possession_%`),
sd = sd(`Away_Team_Possession_%`)))
) %>% mutate("Posesión (%)" = c("Local", "Visitante")) %>%
select("Posesión (%)", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Posesión (%) | max | min | mean | sd |
|---|---|---|---|---|
| Local | 82 | 19 | 51.73526 | 10.81677 |
| Visitante | 81 | 18 | 48.25789 | 10.82105 |
# Pases
bind_rows(data.frame(summarise(train,
max = max(`Home_Team_Pass_Success_%`),
min = min(`Home_Team_Pass_Success_%`),
mean = mean(`Home_Team_Pass_Success_%`),
sd = sd(`Home_Team_Pass_Success_%`))),
data.frame(summarise(train,
max = max(`Away_Team_Pass_Success_%`),
min = min(`Away_Team_Pass_Success_%`),
mean = mean(`Away_Team_Pass_Success_%`),
sd = sd(`Away_Team_Pass_Success_%`)))
) %>% mutate("Pases acertados (%)" = c("Local", "Visitante")) %>%
select("Pases acertados (%)", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Pases acertados (%) | max | min | mean | sd |
|---|---|---|---|---|
| Local | 93 | 46 | 77.68316 | 7.486947 |
| Visitante | 93 | 41 | 75.73737 | 7.947165 |
par(mfrow = c(1, 2))
boxplot(train$`Home_Team_Possession_%`, train$`Away_Team_Possession_%`, main = 'Posesión (%)',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$`Home_Team_Pass_Success_%`, train$`Away_Team_Pass_Success_%`, main = 'Pases acertados (%)',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
En términos de posesión y porcentaje de pases acertados (dos variables lógicamente relacionadas), se observa como ambas distribuciones son prácticamente calcadas, con valores ligeramente más altos para el equipo local peri sin diferencias significativas.
# Tiros
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Total_Shots),
min = min(Home_Team_Total_Shots),
mean = mean(Home_Team_Total_Shots),
sd = sd(Home_Team_Total_Shots))),
data.frame(summarise(train,
max = max(Away_Team_Total_Shots),
min = min(Away_Team_Total_Shots),
mean = mean(Away_Team_Total_Shots),
sd = sd(Away_Team_Total_Shots)))
) %>% mutate("Tiros" = c("Local", "Visitante")) %>%
select("Tiros", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tiros | max | min | mean | sd |
|---|---|---|---|---|
| Local | 33 | 2 | 13.53421 | 4.760000 |
| Visitante | 28 | 0 | 10.49158 | 4.296837 |
# Tiros a puerta
bind_rows(data.frame(summarise(train,
max = max(Home_Team_On_Target_Shots),
min = min(Home_Team_On_Target_Shots),
mean = mean(Home_Team_On_Target_Shots),
sd = sd(Home_Team_On_Target_Shots))),
data.frame(summarise(train,
max = max(Away_Team_On_Target_Shots),
min = min(Away_Team_On_Target_Shots),
mean = mean(Away_Team_On_Target_Shots),
sd = sd(Away_Team_On_Target_Shots)))
) %>% mutate("Tiros a puerta" = c("Local", "Visitante")) %>%
select("Tiros a puerta", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tiros a puerta | max | min | mean | sd |
|---|---|---|---|---|
| Local | 15 | 0 | 4.841053 | 2.536931 |
| Visitante | 13 | 0 | 3.747895 | 2.185722 |
# Tiros fuera
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Off_Target_Shots),
min = min(Home_Team_Off_Target_Shots),
mean = mean(Home_Team_Off_Target_Shots),
sd = sd(Home_Team_Off_Target_Shots))),
data.frame(summarise(train,
max = max(Away_Team_Off_Target_Shots),
min = min(Away_Team_Off_Target_Shots),
mean = mean(Away_Team_Off_Target_Shots),
sd = sd(Away_Team_Off_Target_Shots)))
) %>% mutate("Tiros fuera" = c("Local", "Visitante")) %>%
select("Tiros fuera", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tiros fuera | max | min | mean | sd |
|---|---|---|---|---|
| Local | 17 | 0 | 5.721579 | 2.669599 |
| Visitante | 14 | 0 | 4.352632 | 2.339952 |
# Tiros bloqueados
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Blocked_Shots),
min = min(Home_Team_Blocked_Shots),
mean = mean(Home_Team_Blocked_Shots),
sd = sd(Home_Team_Blocked_Shots))),
data.frame(summarise(train,
max = max(Away_Team_Blocked_Shots),
min = min(Away_Team_Blocked_Shots),
mean = mean(Away_Team_Blocked_Shots),
sd = sd(Away_Team_Blocked_Shots)))
) %>% mutate("Tiros bloqueados" = c("Local", "Visitante")) %>%
select("Tiros bloqueados", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tiros bloqueados | max | min | mean | sd |
|---|---|---|---|---|
| Local | 12 | 0 | 2.981579 | 1.991604 |
| Visitante | 11 | 0 | 2.410526 | 1.837212 |
par(mfrow = c(1, 4))
boxplot(train$Home_Team_Total_Shots, train$Away_Team_Total_Shots, main = 'Tiros',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_On_Target_Shots, train$Away_Team_On_Target_Shots, main = 'Tiros a puerta',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Off_Target_Shots, train$Away_Team_Off_Target_Shots, main = 'Tiros fuera',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Blocked_Shots, train$Away_Team_Blocked_Shots, main = 'Tiros bloqueados',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
En el apartado de tiros vemos como el equipo local sí es dominador en todas las facetas. Y aunque las diferencias no son enormes, sí que parecen suficientes para marcar diferencias (como así lo hacen en el marcador).
# Corners
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Corners),
min = min(Home_Team_Corners),
mean = mean(Home_Team_Corners),
sd = sd(Home_Team_Corners))),
data.frame(summarise(train,
max = max(Away_Team_Corners),
min = min(Away_Team_Corners),
mean = mean(Away_Team_Corners),
sd = sd(Away_Team_Corners)))
) %>% mutate("Corners" = c("Local", "Visitante")) %>%
select("Corners", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Corners | max | min | mean | sd |
|---|---|---|---|---|
| Local | 20 | 0 | 5.657895 | 2.860605 |
| Visitante | 32 | 0 | 4.202632 | 2.523923 |
# Centros
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Throw_Ins),
min = min(Home_Team_Throw_Ins),
mean = mean(Home_Team_Throw_Ins),
sd = sd(Home_Team_Throw_Ins))),
data.frame(summarise(train,
max = max(Away_Team_Throw_Ins),
min = min(Away_Team_Throw_Ins),
mean = mean(Away_Team_Throw_Ins),
sd = sd(Away_Team_Throw_Ins)))
) %>% mutate("Centros" = c("Local", "Visitante")) %>%
select("Centros", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Centros | max | min | mean | sd |
|---|---|---|---|---|
| Local | 47 | 5 | 22.84368 | 6.795616 |
| Visitante | 45 | 3 | 21.61526 | 6.477398 |
# Balones aereos
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Aerials_Won),
min = min(Home_Team_Aerials_Won),
mean = mean(Home_Team_Aerials_Won),
sd = sd(Home_Team_Aerials_Won))),
data.frame(summarise(train,
max = max(Away_Team_Aerials_Won),
min = min(Away_Team_Aerials_Won),
mean = mean(Away_Team_Aerials_Won),
sd = sd(Away_Team_Aerials_Won)))
) %>% mutate("Balones aereos ganados" = c("Local", "Visitante")) %>%
select("Balones aereos ganados", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Balones aereos ganados | max | min | mean | sd |
|---|---|---|---|---|
| Local | 46 | 2 | 16.61579 | 7.241691 |
| Visitante | 47 | 0 | 16.15526 | 6.954876 |
# Despejes
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Clearances),
min = min(Home_Team_Clearances),
mean = mean(Home_Team_Clearances),
sd = sd(Home_Team_Clearances))),
data.frame(summarise(train,
max = max(Away_Team_Clearances),
min = min(Away_Team_Clearances),
mean = mean(Away_Team_Clearances),
sd = sd(Away_Team_Clearances)))
) %>% mutate("Despejes" = c("Local", "Visitante")) %>%
select("Despejes", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Despejes | max | min | mean | sd |
|---|---|---|---|---|
| Local | 61 | 1 | 18.42684 | 8.404368 |
| Visitante | 63 | 2 | 22.92632 | 9.473326 |
par(mfrow = c(1, 4))
boxplot(train$Home_Team_Corners, train$Away_Team_Corners, main = 'Corners',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Throw_Ins, train$Away_Team_Throw_Ins, main = 'Centros',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Aerials_Won, train$Away_Team_Aerials_Won, main = 'Balones aéreos ganados',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Clearances, train$Away_Team_Clearances, main = 'Despejes',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
Al igual que con los tiros, el equipo local domina también los centros y corners, lo cual quiere decir que tiene mayor presencia en área contraria. Esto encaja con el hecho de que sea el equipo visitante el que más balones despeje.
# Faltas
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Fouls),
min = min(Home_Team_Fouls),
mean = mean(Home_Team_Fouls),
sd = sd(Home_Team_Fouls))),
data.frame(summarise(train,
max = max(Away_Team_Fouls),
min = min(Away_Team_Fouls),
mean = mean(Away_Team_Fouls),
sd = sd(Away_Team_Fouls)))
) %>% mutate("Faltas" = c("Local", "Visitante")) %>%
select("Faltas", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Faltas | max | min | mean | sd |
|---|---|---|---|---|
| Local | 33 | 1 | 13.84684 | 4.254319 |
| Visitante | 29 | 0 | 13.87526 | 4.187637 |
# Amarillas
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Yellow_Cards),
min = min(Home_Team_Yellow_Cards),
mean = mean(Home_Team_Yellow_Cards),
sd = sd(Home_Team_Yellow_Cards))),
data.frame(summarise(train,
max = max(Away_Team_Yellow_Cards),
min = min(Away_Team_Yellow_Cards),
mean = mean(Away_Team_Yellow_Cards),
sd = sd(Away_Team_Yellow_Cards)))
) %>% mutate("Tarjetas amarillas" = c("Local", "Visitante")) %>%
select("Tarjetas amarillas", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tarjetas amarillas | max | min | mean | sd |
|---|---|---|---|---|
| Local | 8 | 0 | 2.460526 | 1.521833 |
| Visitante | 8 | 0 | 2.744210 | 1.485703 |
# Segundas amarillas
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Second_Yellow_Cards),
min = min(Home_Team_Second_Yellow_Cards),
mean = mean(Home_Team_Second_Yellow_Cards),
sd = sd(Home_Team_Second_Yellow_Cards))),
data.frame(summarise(train,
max = max(Away_Team_Second_Yellow_Cards),
min = min(Away_Team_Second_Yellow_Cards),
mean = mean(Away_Team_Second_Yellow_Cards),
sd = sd(Away_Team_Second_Yellow_Cards)))
) %>% mutate("Segundas tarjetas amarillas" = c("Local", "Visitante")) %>%
select("Segundas tarjetas amarillas", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Segundas tarjetas amarillas | max | min | mean | sd |
|---|---|---|---|---|
| Local | 2 | 0 | 0.0600000 | 0.2462569 |
| Visitante | 2 | 0 | 0.0836842 | 0.2863345 |
# Rojas
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Red_Cards),
min = min(Home_Team_Red_Cards),
mean = mean(Home_Team_Red_Cards),
sd = sd(Home_Team_Red_Cards))),
data.frame(summarise(train,
max = max(Away_Team_Red_Cards),
min = min(Away_Team_Red_Cards),
mean = mean(Away_Team_Red_Cards),
sd = sd(Away_Team_Red_Cards)))
) %>% mutate("Tarjetas rojas" = c("Local", "Visitante")) %>%
select("Tarjetas rojas", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Tarjetas rojas | max | min | mean | sd |
|---|---|---|---|---|
| Local | 2 | 0 | 0.0442105 | 0.2131614 |
| Visitante | 2 | 0 | 0.0478947 | 0.2255898 |
par(mfrow = c(1, 2))
boxplot(train$Home_Team_Fouls, train$Away_Team_Fouls, main = 'Faltas',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
boxplot(train$Home_Team_Yellow_Cards, train$Away_Team_Yellow_Cards, main = 'Tarjetas amarillas',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
Respecto a las faltas, pese a tener una distribución muy similar, incluso el equipo visitante con una media por partido inferior, es este último el que más tarjetas recibe. Probablemente se deba (recordando que el equipo local crea más acciones en campo contrario) a que dichas faltas cortan jugadas potencialmente más peligrosas.
Aunque seguramente influirán otros factores más difíciles de medir como la presión que ejerza el público local sobre el árbitro a la hora de tomar decisiones.
# Goles
bind_rows(data.frame(summarise(train,
max = max(Home_Team_Goals_Scored),
min = min(Home_Team_Goals_Scored),
mean = mean(Home_Team_Goals_Scored),
sd = sd(Home_Team_Goals_Scored))),
data.frame(summarise(train,
max = max(Away_Team_Goals_Scored),
min = min(Away_Team_Goals_Scored),
mean = mean(Away_Team_Goals_Scored),
sd = sd(Away_Team_Goals_Scored)))
) %>% mutate("Goles" = c("Local", "Visitante")) %>%
select("Goles", everything()) %>%
kbl() %>%
kable_material(c("striped", "hover"))
| Goles | max | min | mean | sd |
|---|---|---|---|---|
| Local | 10 | 0 | 1.563684 | 1.364813 |
| Visitante | 8 | 0 | 1.161053 | 1.180123 |
par(mfrow = c(1, 2))
boxplot(train$Home_Team_Goals_Scored, train$Away_Team_Goals_Scored, main = 'Goles',
ylab = "Valores", col = c("#d55566","#066699"))
axis(1, at = 1:2, labels = c("Local", "Visitante"))
Por último se observa que los equipos locales marcan más goles que los visitantes, algo relativamente previsible sabiendo que las victorias suelen caer del lado local.
Para finalizar el análisis univariante a continuación se muestran los histogramas con su curva de densidad de cada una de las variables, donde se aprecia a simple vista en casi todas como siguen una distribución normal.
plot_grid(ggplot(train, aes(`Home_Team_Possession_%`)) +
geom_histogram(aes(y=..density..), bins = 20,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Posesión (%)", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(`Away_Team_Possession_%`)) +
geom_histogram(aes(y=..density..), bins = 20,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Posesión (%)", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(`Home_Team_Pass_Success_%`)) +
geom_histogram(aes(y=..density..), bins = 20, position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Pases acertados (%)", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(`Away_Team_Pass_Success_%`)) +
geom_histogram(aes(y=..density..), bins = 20,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Pases acertados (%)", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
align = "h", ncol = 2, nrow = 2)
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
plot_grid(ggplot(train, aes(Home_Team_Total_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Total_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_On_Target_Shots)) +
geom_histogram(aes(y=..density..), bins = 10, position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros a puerta", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_On_Target_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros a puerta", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Off_Target_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros fuera", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Off_Target_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros fuera", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Blocked_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros bloqueados", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Blocked_Shots)) +
geom_histogram(aes(y=..density..), bins = 10,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tiros bloqueados", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
align = "h", nrow = 2, ncol = 4)
plot_grid(ggplot(train, aes(Home_Team_Corners)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Corners", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Corners)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Corners", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Throw_Ins)) +
geom_histogram(aes(y=..density..), bins = 15, position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Centros", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Throw_Ins)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Centros", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Aerials_Won)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Balones aéreos ganados", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Aerials_Won)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Balones aaéreos ganados", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Clearances)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Despejes", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Clearances)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Despejes", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
align = "h", nrow = 2, ncol = 4)
plot_grid(ggplot(train, aes(Home_Team_Fouls)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Faltas", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Fouls)) +
geom_histogram(aes(y=..density..), bins = 15,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Faltas", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Yellow_Cards)) +
geom_histogram(aes(y=..density..), bins = 5, position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tarjetas amarillas", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Yellow_Cards)) +
geom_histogram(aes(y=..density..), bins = 5,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Tarjetas amarillas", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Home_Team_Goals_Scored)) +
geom_histogram(aes(), bins = 10,position = "dodge", color = 'black', fill = '#d55566') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Goles", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train, aes(Away_Team_Goals_Scored)) +
geom_histogram(aes(), bins = 10,position = "dodge", color = 'black', fill = '#066699') +
geom_density(alpha=.3, fill = "lightgreen") + labs(x="Goles", y="Frecuencia") + theme(plot.title=element_text(hjust = 0.5)),
align = "h", nrow = 2, ncol = 4)
Sin embargo en el caso de los goles no parece tan evidente que la distribución sea normal (más bien parece una exponencial). Para comprobarlo recurrimos a un gráfico Q-Q Plot.
# QQPlot para comprobar su normalidad
qqplot(train$Home_Team_Goals_Scored, qnorm(ppoints(length(train$Home_Team_Goals_Scored))))
Como sospechábamos la función no es normal (debería salir aproximadamente una recta en el gráfico Q-Q Plot).
A continuación se muestra, comparándola con una exponencial, como en este caso sí se asemeja más.
# QQPlot para comprobar si es exponencial
qqplot(train$Home_Team_Goals_Scored, qexp(ppoints(length(train$Home_Team_Goals_Scored))))
El resto de variables sí se distribuyen de forma normal, como se ve, por ejemplo, en el Q-Q Plot de la posesión (por elegir alguno):
qqplot(train$`Home_Team_Possession_%`, qnorm(ppoints(length(train$`Home_Team_Possession_%`))))
Una vez analizadas todas las variables de forma individual, vamos a buscar las relaciones que hay entre ellas, cruzándolas en los siguientes gráficos y más adelante representando aquellas que más interés tienen.
En nuestro caso, las variables objetivo a las que prestaremos especial atención son el resultado y los goles de cada equipo, ya que son las que determinan el desenlace de los partidos.
num_data <- train %>% select("Home_Team_Goals_Scored", "Home_Team_Possession_%", "Home_Team_Pass_Success_%", "Home_Team_Total_Shots", "Home_Team_On_Target_Shots", "Home_Team_Off_Target_Shots", "Home_Team_Blocked_Shots", "Home_Team_Corners", "Home_Team_Throw_Ins", "Home_Team_Aerials_Won", "Home_Team_Clearances", "Home_Team_Fouls", "Home_Team_Yellow_Cards", "Home_Team_Second_Yellow_Cards", "Home_Team_Red_Cards", "Away_Team_Goals_Scored", "Away_Team_Possession_%", "Away_Team_Pass_Success_%", "Away_Team_Total_Shots", "Away_Team_On_Target_Shots", "Away_Team_Off_Target_Shots", "Away_Team_Blocked_Shots", "Away_Team_Corners", "Away_Team_Throw_Ins", "Away_Team_Aerials_Won", "Away_Team_Clearances", "Away_Team_Fouls", "Away_Team_Yellow_Cards", "Away_Team_Second_Yellow_Cards", "Away_Team_Red_Cards", "year")
ggcorrplot(cor(num_data),type = "lower", lab=FALSE)
num_data <- train %>% select("Home_Team_Goals_Scored", "Home_Team_Possession_%", "Home_Team_Pass_Success_%", "Home_Team_Total_Shots", "Home_Team_On_Target_Shots", "Home_Team_Off_Target_Shots", "Home_Team_Blocked_Shots", "Home_Team_Corners", "Home_Team_Throw_Ins", "Home_Team_Aerials_Won", "Home_Team_Clearances", "Home_Team_Fouls", "Home_Team_Yellow_Cards", "Home_Team_Second_Yellow_Cards", "Home_Team_Red_Cards", "year")
ggcorrplot(cor(num_data),type = "lower", lab=TRUE)
Además de la obvia correlación directa entre la posesión de uno y otro equipo, podemos ver como también el aumento de la posesión (que va ligado al de pases exitosos), hace generar más ocasiones de ataque (tiros, corners, etc.).
Hay una relación también evidente (y son las más fuertes del dataset) entre los distintos tipos de tiros (cuanto más tiras, más disparas a puerta y también más fuera). También generas más corners.
Como curiosidad llama la atención como con los años, prácticamente las únicas correlaciones reseñables son los centros (que disminuyen en cada temporada) y los pases acertados (que aumentan). Un indicio de la dirección en la que va últimamente este deporte, hacia un fútbol más asociativo y con menos juego por banda.
Por otra parte, los goles (que era la variable que más nos interesaba), solo tiene una relación marcada con los disparos a puerta.
train %>% select("score", "jornada", "year", "Home_Team_Goals_Scored", "Home_Team_Possession_%", "Home_Team_Total_Shots", "Home_Team_On_Target_Shots") %>%
ggpairs(aes(color = score, alpha = .5))
Nuevamente observamos como la dependencia que más posibilidades tiene de tener relación con los goles, son los tiros a puerta.
A continuación se muestran algunos de los gráficos de barras obtenidos en busca de nuevas relaciones.
plot_grid(ggplot(train,aes(score , Home_Team_Goals_Scored)) + geom_bar(position = "dodge", stat = "summary", fun = "mean", color = 'black', fill = '#066699') + labs(title="Media de goles del equipo local por partido", x="resultado", y="disparos") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train,aes(score , Home_Team_Total_Shots)) + geom_bar(position = "dodge", stat = "summary", fun = "mean", color = 'black', fill = '#066699') + labs(title="Media de tiros del equipo local por partido", x="resultado", y="disparos") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train,aes(score , Home_Team_On_Target_Shots)) + geom_bar(position = "dodge", stat = "summary", fun = "mean", color = 'black', fill = '#066699') + labs(title="Media de tiros a puerta del equipo local por partido", x="resultado", y="disparos") + theme(plot.title=element_text(hjust = 0.5)),
ggplot(train,aes(score , `Home_Team_Possession_%`)) + geom_bar(position = "dodge", stat = "summary", fun = "mean", color = 'black', fill = '#066699') + labs(title="Media de posesion del equipo local por partido", x="resultado", y="disparos") + theme(plot.title=element_text(hjust = 0.5)),
align = "h", nrow = 2)
También se han intentado encontrar patrones a lo largo del tiempo analizando tanto las temporadas como las jornadas, pero no ha aparecido nada relevante.
ggplot(train, aes(Home_Team_Goals_Scored)) + geom_bar(fill="#d55566") +
facet_wrap(~ year, nrow=1)
ggplot(train, aes(Away_Team_Goals_Scored)) + geom_bar(fill="#066699") +
facet_wrap(~ year, nrow=1)
ggplot(train, aes(Home_Team_Goals_Scored)) + geom_bar(fill="#d55566") +
facet_wrap(~ jornada, ncol = 8)
Tras haberle dado muchas vueltas hemos llegado a la conclusión de que para predecir el número de goles de un equipo, tendríamos que definir nuevas variables
recodificar variables
train <- mutate(train, goals = Home_Team_Goals_Scored + Away_Team_Goals_Scored)
train <- mutate(train, total_off_target_shots = Home_Team_Off_Target_Shots + Away_Team_Off_Target_Shots)
train <- mutate(train, total_on_target_shots = Home_Team_On_Target_Shots + Away_Team_On_Target_Shots)
train <- mutate(train, total_shots = Home_Team_Total_Shots + Away_Team_Total_Shots)
train <- mutate(train, total_blocked_shots = Home_Team_Blocked_Shots + Away_Team_Blocked_Shots)
train <- mutate(train, total_corners = Home_Team_Corners + Away_Team_Corners)
train <- mutate(train, total_throw_ins = Home_Team_Throw_Ins + Away_Team_Throw_Ins)
train <- mutate(train, total_pass_success = `Home_Team_Pass_Success_%` + `Away_Team_Pass_Success_%`)
train <- mutate(train, total_aerials_won = Home_Team_Aerials_Won + Away_Team_Aerials_Won)
train <- mutate(train, total_clearances = Home_Team_Clearances + Away_Team_Clearances)
train <- mutate(train, total_fouls = Home_Team_Fouls + Away_Team_Fouls)
train <- mutate(train, total_yellow_cards = Home_Team_Yellow_Cards + Away_Team_Yellow_Cards)
train <- mutate(train, total_second_yellow_cards = Home_Team_Second_Yellow_Cards + Away_Team_Second_Yellow_Cards)
train <- mutate(train, total_red_cards = Home_Team_Red_Cards + Away_Team_Red_Cards)